11. Split Dataset for Model Development Exercise

Split Dataset For Model Development Exercise

In this exercise, we're going to prepare a dataset to be fed into a classification model for training. You're given a dataframe with image labels for 5,000 chest x-rays. Your goal is to prepare a training and a validation dataset for an algorithm that predicts the presence of a Pneumothorax (collapsed lung). Remember, we want our model to see an equal amount of positive and negative cases when it's training, but when we evaluate its performance, we should be looking at a class balance or imbalance that is more reflective of the real world. In this exercise,

You will notice that Pneumothorax isn't a very common finding in this dataset, so you'll want to maximize your data so that you can use all of the true Pneumothorax cases in training. Given the large class imbalances, however, you may end up throwing away images that don't contain Pneumothorax.

Here's an assumption you can make when creating your validation set: Despite the large imbalance of Pneumothorax in this dataset, in the actual clinical setting where you want to deploy your algorithm, the prevalence of Pneumothorax will be about 20%. This is because patients are only being x-rayed based on their clinical symptoms that make Pneumothorax highly likely.

The final output of this exercise should be two dataframes: one containing data for training your algorithm, and one containing data for validating your algorithm.

Code

If you need a code on the https://github.com/udacity.